Importing Required Packages

Importing and understanding the dataset

Null values in columns Education_Level and Marital_Status

No duplicate rows in the dataset

Observations

  1. Customer_Age has a mean of 46 with a min of 26 and a max of 73
  2. Dependent_count has a min of zero, implying that some customers are single. Max of 5 dependents seems like an outlier. Should be categorical.
  3. Months_ob_book has a mean of 36 months with the minimum being 13 months. Max is 56 months, which implies the oldest relationship is ~4.5 years old.
  4. Total_Relationship_Count should be categorical.
  5. Months_Inactive_12_mon has a max of 6 months, which implies being inactive for half the year.
  6. Contacts_Count_12_mon has a 50th quantile of 2, which implies 50% of the customers are contacted twice in 12 months.
  7. Credit_Limit has a minimum of 1,438 and a max of 34,516. Seems like a broad range.
  8. Total_Revolving_Bal has a mean of 1,162.
  9. Avg_Open_To_Buy has a max of 34,516 which means the customer has not used the card.
  10. Total_Amt_Chng_Q4_Q1 has a min of 0 which means there are customers who have not used the card in Q4. On average, the usage is greater in Q1 than in Q4 by 76%.
  11. Total_Trans_Amt ranges from 500 to 18K.
  12. Total_Trans_Ct ranges from 10 to 139.
  13. Total_Ct_Chng_Q4_Q1 has a min of 0 which means there are customers who have not used the card in Q4. On average, the usage is greater in Q1 than in Q4 by 71%.
  14. Avg_Utilization_Ratio has a min of 0 which means some customer(s) have not used the card.

Exploratory Data Analysis

Univariate Analysis

Attrition_Flag

Two types of customers, existing customers and attrited customers. A large proportion of the customers are not churned, which means we need to upsample the churned customers to ensure there is equal representation of each class.

Customer_Age

Customer_Age looks normally distributed. Looks like there are a couple outlier points. These need to be reviewed.

Gender

Dependent_count

Number of dependents range from none to 5, with 27% of the customers having 3 dependents followed by those who have 2

Education_Level

30.9% of the customers are Graduates, followed by those who finished high school.
15% of the customers have a missing education level, which needs to be filled in.

Marital_Status

46.3% of the customers are married and 38.9% are single.
There are some customers whose Marital Status is unknown. These need to be filled in.

Income_Category

35.2% of the customers earn less than 40K with 17.7% earning between 40K and 60K.
Some customers have an Income_Category 'abc'. This needs to be worked on.

Card_Category

Most customers (93.2%) have the blue credit card, followed by silver and then gold and platinum

Months_on_book

This data is highly skewed. Needs to be processed before modelling.

Total_Relationship_Count

22.8% of customers hold 3 products with the bank, followed by 18.9% holding 4 products.

Months_Inactive_12_mon

38% of customers have been inactive for 3 months out of 12.
32.4% of customers have been inactive for 2 months out of 12.

Contacts_Count_12_mon

33.4% of customers were contacted 3 times in the last 12 months.
31.9% of customers were contacted 2 times in the last 12 months.

Credit_Limit

Total_Revolving_Bal

Avg_Open_To_Buy

Total_Amt_Chng_Q4_Q1

Total_Trans_Amt

Total_Trans_Ct

Total_Ct_Chng_Q4_Q1

Avg_Utilization_Ratio

Bivariate Analysis

Performing bivariate analysis with Attrition_Flag since it is the target variable

High correlation between the following columns:

  1. Customer_Age and Months_on_book
  2. Avg_Open_To_Buy and Credit_Limit
  3. Total_Trans_Ct and Total_Trans_Amt
  4. Total_Revolving_Bal and Avg_Utilization_Ratio

Observations from EDA

  1. 16% of the customers are churned
  2. 52% of the customer base is Female
  3. 27% of the customers have 3 dependents
  4. 31% of the customers are graduates
  5. 46.3% of customers are married
  6. 39% of customers are single
  7. 35.2% of customers earn less than 40K
  8. 93% of customers have the blue credit card
  9. 23% of customers hold 3 products
  10. Total Revolving Balance is lower for Attrited customers
  11. Total Transaction Count is lower for Attrited customers
  12. Utilization Ratio is lower for Attrited customers

Data Preparation

Encoding Target Variable

Feature Engineering

Split Data

Missing Value Treatment

Education_Level, Marital_Status - have null values
Income_Category has abc which needs to be treated

Education_Level

Marital_Status

Income_Category

Creating Dummy Variables

Model Evaluation Criteria

Two Scenarios of Losses:

  1. A customer leaves the bank when the prediction was that they will stay.
  2. A customer stays with the bank when the prediction was that they will leave.

Which is worse for the bank:
I think (2) is worse in this case as the bank is already looking for ways to retain customers. So the metric of importance here is recall

Model Building

Gradient Boosting (82.5%) followed by Adaboost (81%) seem to be the best models from the above scores

Model Building- Oversampled Data

Oversampling Training Data using SMOTE

Random Forest (97.1%) followed by Gradient Boosting (96%) seem to be best models based on the above scores

Model Building- Undersampled Data

Undersampling Training Data using Random Under Sampler

Gradient Boosting (94%) followed by Random Forest (93.7%)

Model Selection

I will choose the following 3 models to go ahead with the project:

  1. Random Forest with oversampled data
  2. Gradient Boosting with oversampled data
  3. Gradient Boosting with undersampled data

Hyperparameter Tuning

Random Forest with Oversampled Training Data

The recall is less than the cross validation recall but the model seems to be performing well

Gradient Boosting with Oversampled Training Data

Gradient Boosting with Undersampled Training Data

Comparing all Models

Performance on the test test

Performance on test data is generalised

Pipelines for Productionizing the Model